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1. Introduction 

A basic task in statistics is to ascertain whether a given set of independent 
and identically distributed (i.i.d.) draws does not come from any member of a 
specified family of probability distributions (the specified family is known as the 
"model"). The present paper considers the case in which the draws are discrete 
random variables, taking values in a finite set. In accordance with the standard 
terminology, we will refer to the possible values of the discrete random variables 
as "bins" ("categories," "cells," and "classes" are common synonyms for "bins"). 
In earlier work, Perkins, Tygert, and Ward (2011b) treated the special case in 
which the "family" of distributions constituting the model in fact consists of 
a single, fully specified probability distribution. The present article focuses on 
models parameterized with a single scalar; our techniques extend straightfor- 
wardly to any parameterization with multiple scalars (or, equivalently, to any 
parameterization with a vector). 

A natural approach to ascertaining whether a given set of i.i.d. draws does 
not come from the model uses a root-mean-square statistic. To construct this 
statistic, we estimate both the parameter and the probability distribution over 
the bins using the given i.i.d. draws, and then measure the root-mean-square 
difference between this empirical distribution and the model distribution corre- 
sponding to the estimated parameter (for details, see, for example, Rao, 2002; 
Varadhan, Levandowsky, and Rubin, 1974, page 123; or Section 3 below). If the 
draws do in fact arise from the specified model, then with high probability this 
root-mean-square is not large. Thus, if the root-mean-square statistic is large, 
then we can be confident that the draws did not arise from the model. 

To quantify "large" and "confident," let us denote by x the value of the root- 
mean-square for the given i.i.d. draws; let us denote by X the root-mean-squarc 
statistic constructed for different i.i.d. draws that definitely do in fact come 
from the model. The P-value P is then defined to be the probability that X > x 
(viewing X — but not random variable). The confidence level that the 

given i.i.d. draws do not arise from the model is the complement of the P-value, 
namely 1 — P. 
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Unfortunately, the confidence levels for the simple root-mean-square arc dif- 
ferent for different models. In order to avoid this seeming inconvenience (at least 
asymptotically), one may weight the average in the root- mean-square by the 
reciprocals of the model probabilities associated with the various bins, obtain- 
ing the classic x 2 statistic of Pearson (1900); see Remark 3.1 below. However, 
with the now widespread availability of computers, direct use of the simple 
root-mean-square statistic has become feasible (and actually turns out to be 
very convenient). The present paper provides efficient black-box algorithms for 
computing the confidence levels for any model with a smooth parameteriza- 
tion, in the limit of large numbers of draws. Calculating confidence levels for 
small numbers of draws via Monte-Carlo simulations can also be practical. The 
many advantages to using the root-mean-square have been discussed at length 
by Perkins, Tygert, and Ward (2011a). 

The remainder of the present article has the following structure: Section 2 
reviews previously developed techniques utilized in the following sections. Sec- 
tion 3 details the simple statistic discussed above, expressing the asymptotic 
confidence levels for the associated goodness-of-fit test in a form suitable for 
rapid computation. Section 4 applies the algorithms of the present paper to 
several examples. Section 5 draws some conclusions. 

2. Preliminaries 

This section summarizes a previously introduced numerical method. 

The following theorem (proven in Section 3 of Perkins, Tygert, and Ward, 
2011b) expresses the cumulative distribution function of the sum of the squares 
of independent centered Gaussian random variables as an integral suitable for 
evaluation via quadratures. 

Theorem 2.1. Suppose that n is a positive integer, X\, X-x, . . . , X n —i, X n are 
i.i.d. Gaussian random variables of zero mean and unit variance, and o\, o~2, 
. . . , On-i, &n o,re positive real numbers. Suppose in addition that X is the ran- 
dom variable 

n 

x = Y / WkX k f. (i) 

fc=l 

Then, the cumulative distribution function F of X is 

e l-t git\Ai \ 

(* - r4^r) IILi 7i - 2(t - iH/x + 2ito-i^i/x) dt 

(2) 

for any positive real number x, and F(x) = for any nonpositive real number x. 
The square roots in (2) denote the principal branch, and Im takes the imaginary 
part. 
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Remark 2.2. The absolute value of the expression under the square root in (2) 
is always greater than ^Jn/{n + 1). Therefore, 



for any t <S (0, oo) and any x £ (0,oo). Thus, the integrand in (2) is never large 
for t e (0, oo). 

Remark 2.3. An efficient means of evaluating (2) numerically is to use adap- 
tive Gaussian quadratures (see, for example, Section 4.7 of Press et al., 2007). 
To attain double-precision accuracy (roughly 15-digit precision), the domain of 
integration for t in (2) need be only (0,40) rather than the whole (0, oo). Good 
choices for the lowest orders of the quadratures used in the adaptive Gaus- 
sian quadratures are 10 and 21, for double-precision accuracy. (See Section 3 of 
Perkins, Tygert, and Ward, 2011b, for the details.) 

3. The simple statistic 

This section details the simple root-mean-square statistic discussed briefly in 
Section 1, determining its probability distribution in the limit of large numbers 
of draws, assuming that the draws do in fact come from the specified model. The 
distribution determined in this section yields the confidence levels (in the limit 
of large numbers of draws) : Given a value x for the root-mean-square statistic 
constructed from i.i.d. draws coming from an unknown distribution, and given 
the value of the maximum-likelihood estimate 9 for the parameter of the dis- 
tribution, the confidence level that the draws do not come from the specified 
model is the probability that the root-mean-square statistic is less than x when 
constructed from i.i.d. draws that do come from the model distribution associ- 
ated with the parameter 9. (Please note that the definition in (5) and (6) below 
of the simple statistic involves the maximum-likelihood estimate 9. Maximum 
likelihood is the canonical method for parameter estimation, and is the focus of 
the present paper. See formulae (16) and (17) below regarding likelihood and 
maximum- likelihood estimation.) 

3.1. The distribution of the goodness- of -fit statistic 

To begin, we set notation and form the goodness-of-fit statistic X to be analyzed. 
Given n bins, numbered 1, 2, . . . , n — 1, n, we denote by p\(9), P2(9), ■ • ■ , 
Pn-i(#) ; Pn(0) the probabilities associated with the respective bins under the 
specified model, where 9 is a real number parameterizing the model; of course, 




(3) 




(4) 
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for any parameter 9. In order to obtain a draw conforming to the model for a 
particular value of 9, we select at random one of the n bins, with probabilities 
Pi(6), P2{9), •••) Pn-i(Q), Pn{0). We perform this selection independently m 
times. For k = 1, 2, n— 1, n, we denote by Yfc the fraction of times that we 
choose bin k (that is, Y k is the number of times that we choose bin k, divided 
by m); obviously, Ylk=i = 1- We define Xk to be ^frn times the difference 
of Yfc from its expected value using the maximum-likelihood estimate 9 of the 
actual parameter 9, that is, 

X k = V^(Y k -p k (§)) (5) 
for k = 1, 2, . . . , n — 1, n. Finally, we form the statistic 

* = (6) 

fc=i 

and now determine its distribution in the limit that the number m of draws is 



large. (The root-mean-square statistic y 52*=i ( TO ^fc — m Pk(6)) 2 / 'm is the square 
root of X. As the square root is a monotonically increasing function, the confi- 
dence levels are the same whether determined via X or via \[X\ for convenience, 
we focus on X below.) 

Remark 3.1. The classic % 2 test for goodness-of-fit of Pearson (1900) re- 
places (6) with the statistic 

* 2 = E^ (7) 

k=1 Pk{9) 

where X\, X%, X n _\, X n are the same as in (5) and (6), and 9 is the 
maximum-likelihood estimate of the parameter. 

For dcfinitcncss, we will be assuming that pi, p2, ■ ■ ■ , Pn-Xi Pn are diffcren- 
tiable as functions of the parameter 9, that the maximum of the likelihood occurs 
in the interior of the domain for 9, that the maximum-likelihood estimate 9 is 
almost surely the correct value for the actual parameter 0asra->oo, and that 
the variance of 9 tends to zero asm->oo (thus 9 is not "random" in the limit of 
large numbers of draws). As detailed, for example, by Moore and Spruill (1975) 
and by Kendall et al. (2009) in a chapter on goodness-of-fit (see also Remark 3.3 
below), the multivariate central limit theorem then shows that the joint distri- 
bution of X\, X2, • ■ • , X n —i, X n converges in distribution asm-> 00, with the 
limiting generalized probability density proportional to 



(8) 



where d is the Dirac delta, and 9 is the maximum-likelihood estimate of the 
parameter. 
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The generalized probability density in (8) is a centered multivariate Gaussian 
distribution concentrated on the intersection of two hyperplanes that both pass 
through the origin (the intersection of the hyperplanes consists of all the points 
such that X)fe=i Xk = an d Y^k=i x k jg ^ n (Pk(9))\ e _g = 0); the restriction of 
the generalized probability density (8) to the intersection of the hyperplanes 
is also a centered multivariate Gaussian. Thus, the distribution of X defined 
in (6) converges as m — > oo to the distribution of the sum of the squares of 
n — 2 independent Gaussian random variables of mean zero whose variances 
are the variances of the restricted multivariate Gaussian distribution along its 
principal axes (see, for example, Chapter 25 of Kendall ct al., 2009). Given 
these variances, Remark 2.3 describes an efficient algorithm for computing the 
probability that the associated sum of squares is less than any particular value; 
this probability is the desired confidence level, in the limit of large numbers of 
draws. For a detailed discussion, see Section 3.2 below. 

To compute the variances of the restricted multivariate Gaussian distribution 
along its principal axes, we perform the following four steps: 

1. Form an n x 2 matrix H whose columns both include a vector that is 
normal to the hyperplane consisting of the points (x±, X2, ■ ■ ■ , x n -i, x n ) 
such that 



J2- 



x k =0, (9) 

k=l 

and also include a vector that is normal to the hyperplane consisting of 
the points (xi, x%, . . . , x n —i, x n ) such that 



n , 

5> fe -ln(p fe (0)) 



dO 

k=l 



= 0, (10) 

8=6 



where 9 is the maximum-likelihood estimate of the parameter. For exam- 
ple, we can take the entries of H to be 

Hk ' j = { lHp k mU, 5 = 2 

for k = 1, 2, . . . , n — 1, n and j = 1, 2, where again 9 is the maximum- 
likelihood estimate of the parameter. 

2. Form an orthonormal basis for the column space of H , by constructing a 
pivoted QR decomposition 

H nX 2 — Qnx2 ■ R2x2 1 1^2x27 (12) 

where the columns of Q are orthonormal, R is upper-triangular, and IT is a 
permutation matrix. (See, for example, Chapter 5 of Golub and Van Loan, 
1996, for details on the construction of such a pivoted QR decomposition.) 

3. Form the n x n diagonal matrix D with the entries 
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for j, k = 1, 2, . . . , n — 1, n, where 9 is the maximum- likelihood estimate of 
the parameter. Then, multiply D from both the left and the right by the 
orthogonal projection (1 — QQ T ) onto the intersection of the hyperplanes 
consisting of the points satisfying (9) and (10), obtaining the nxn matrix 



B = (1-QQ T )D(1- 



(14) 



where 1 is the nxn identity matrix. 
4. Find the eigenvalues of the self-adjoint matrix B defined in (14). By con- 
struction, exactly two of the eigenvalues of B are zeros. The other eigen- 
values of B are the reciprocals of the desired variances of the restricted 
multivariate Gaussian distribution along its principal axes. 

Remark 3.2. The nxn matrix B defined in (14) is the sum of a diagonal matrix 
and a low-rank matrix. The methods of Gu and Eiscnstat (1994, 1995) for com- 
puting the eigenvalues of such a matrix B require only either 0(n 2 ) or 0(n) 
floating-point operations. Note that the 0(n 2 ) methods of Gu and Eiscnstat 
(1994, 1995) arc more efficient than the 0(n) procedure of Gu and Eisenstat 
(1995), unless n is impractically large. 

Remark 3.3. Under appropriate regularity conditions, it is easy to derive the 
homogeneous linear constraint — analogous to (10) — that 



fe=i 



kTd H Pk (0)) 



= o, 



(15) 



where 9 is the maximum-likelihood estimator. The following is a sketch of the 
proof of (15). 

To determine the maximum-likelihood estimate 9, we consider the likelihood, 
namely the multinomial distribution 



L{yi,y2, ■ ■ ■ ,y n -i,y n ,9) = mi [[—, r^— 

^ {my k )\ 



Maximizing (16) defines 9 via the formula 
O = ^ln(Z(Fi,y 2 ,...,y n _i,F„,0)) 
It follows from (4) that 



J2mY k -\n(p k (9)) 



k=l 



(16) 



(17) 



fc=i 



C^P*(*) = (18) 
Combining (17) and (18) yields 

= 0. (19) 



for any parameter 9, in particular for 9 = 
that 

n j 

£(n-j%(0))^Mp*(0)) 

k=l 



Combining (19) and (5) yields (15), as desired. 
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3.2. A procedure for computing the confidence levels 

An efficient method for calculating the confidence levels in the limit of large 
numbers of draws proceeds as follows. Given i.i.d. draws from any distribution 
- not necessarily from the model — we can form the associated statistic X 
defined in (6) and (5); in the limit of large numbers of draws, the confidence 
level that the draws do not arise from the model is then just the cumulative 
distribution function F(x) in (2) evaluated at x = X , with a\ in (2) obtained via 
Step 4 of the algorithm of Section 3.1 (after all, F(x) is the probability that x is 
greater than the sum of the squares of independent centered Gaussian random 
variables whose variances are given by Step 4 above). Remark 2.3 describes an 
efficient means of evaluating F(x) numerically. 

4. Numerical examples 

This section illustrates the performance of the algorithm of Section 3.2 via 
several numerical examples. 

Figure 1 and Table 1 correspond to the first example. The model distribution 
for the first example has 4 bins, with the probabilities indicated in Table 4. We 
will detail the interpretation of the figures and tables shortly. 

Figure 2 and Table 2 correspond to the second example. The model for the 
second example is the Zipf distribution on 100 bins. The row for Figure/Table 2 
in Table 4 provides a definition of the Zipf distribution. 

Figure 3 and Table 3 correspond to the third example. The model for the 
third example is the standard Poisson distribution. The row for Figure/Table 3 
in Table 4 provides a definition of the Poisson distribution. 

To test our algorithms, we conduct computational simulations. In every sim- 
ulation, we choose the number m of draws to be a very large number, namely 
m = 100,000. (The algorithms of the present paper concern the limit as m — > oo.) 
Part (a) of the examples uses j = 1,000 simulations; part (b) of the examples 
uses j = 10,000 simulations. The convergence (as j increases) of the plotted 
points to the straight line of unit slope through the origin provides numerical 
validation of our algorithms, for the following reasons. 

To create the plots, we run j simulations, each taking m = 100,000 i.i.d. 
draws from the model distribution with the specified parameter 6. For each 
simulation, we compute the statistic X defined in (6), forming Yi, Y2, 
Y n -i, Y n and 9 needed in (5) and (6) using the generated draws. We then 
compute the asymptotic confidence level associated with each of these values 
for X, as described in Section 3.2, and sort the resulting confidence levels. 
These sorted results are the vertical coordinates of the points in the plots; the 
horizontal coordinates are the equispaced numbers l/(2j), 3/(2j), (2j — 
3)/(2j), (2j-l)/(2j). 

As the number j of simulations increases, and insofar as the number m of 
draws is very large, the plotted points should converge to the straight line 
through the origin of slope 1 (and, indeed, our experiments demonstrate this). 
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The dotted line in each plot is the straight line through the origin of slope 1. 
The trials converge correctly: The root-mean-square statistics for about a% of 
the simulations should have P-values of a% or less, for every a £ (0, 100); in 
the limit that both the number m of draws and the number j of simulations 
are large, the computed P-values for exactly a% of the simulations should be 
less than or equal to a% (this follows from the definition of P-values; it also fol- 
lows from the fact that the confidence levels for the statistic X arc given by its 
cumulative distribution function F, and from the fact that F(X) is uniformly 
distributed over (0,1) for any random variable X distributed according to a 
continuous cumulative distribution function F). 

The following list describes the headings of the tables: 

• j is the number of simulations conducted in generating the associated plot. 

• 9 is the parameter for the model distribution used in generating the i.i.d. 
draws. 

• n is the number of bins/categories/cells/classes in the model (see Re- 
mark 4.2 regarding the Poisson distribution of the third example). 

• / is the maximum number of quadrature nodes required to evaluate the 
confidence level for any of the j root-mean-square statistics produced by 
the simulations. 

• t is the total number of seconds required to perform the quadratures for 
evaluating the confidence levels for all j of the root-mean-square statistics 
produced by the simulations. 

• s is the total number of seconds required to perform all j simulations. 

• pk{0) is the probability associated with bin k (k = 1, 2, . . . , n — 1, n), as 
a function of the parameter 9. 

• 9(Yi, Y27 • ■ • j Y n _i, Y n ) is the maximum-likelihood estimate of the param- 
eter 9, as a function of the fractions Yi, Y-x, ■ ■ ■ , Y n _i, Y n of the draws in 
the respective bins (Section 3 provides a detailed definition of Yi, Y2, . . . , 

Yn — 1 , Y n ) . 

We used Fortran 77 and ran all examples on one core of a 2.2 GHz Intel 
Core 2 Duo microprocessor with 2 MB of L2 cache. Our code is compliant with 
the IEEE double-precision standard (so that the mantissas of variables have ap- 
proximately one bit of precision less than 16 digits, yielding a relative precision 
of about 2E-16). We diagonalized the matrix B defined in (14) using the Jacobi 
algorithm (see, for example, Chapter 8 of Golub and Van Loan, 1996), not tak- 
ing advantage of Remark 3.2. We generated the pseudorandom numbers used in 
the simulations via (Mitchell-Moore-Brcnt-Knuth) lagged Fibonacci sequences 
(see, for example, Section 7.1.5 of Press et al., 2007). 

Observation 1. It is easy to compute the confidence levels (in the limit of large 
numbers of draws) for a distribution having infinitely many bins, but only to 
any arbitrary accuracy that is greater than the machine precision. Specifically, 
given a fully specified model distribution and an extremely small positive real 
number e, we would retain the smallest possible number of bins whose associated 
probabilities pi, p%, ■ ■ ■ , p n -i, Pn satisfy pi+p2 + - ■ ■+Pn-i+Pn > 1~ £, an d then 
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proceed with the computation as if these finitely many were the only bins. When 
there is a parameter 9 being estimated, we observe that the maximum-likelihood 
estimate 9 typically has variance zero and is almost surely correct in the limit 
of large numbers of draws; thus, as before, we may retain the smallest possible 
number of bins whose associated probabilities Pi{9), P2(9), • ■ • , Pn-i(O), Pn{6) 
satisfy pi{9) +P2{9) + ••• + p n -i{&) + Pn{8) > 1 — e, and then proceed with the 
computation as if these finitely many were the only bins. (Needless to say, if 
the fraction of the experimental draws falling outside the finitely many retained 
bins is significantly greater than e, then we can be highly confident that the 
draws did not arise from the model.) 

Remark 4.1. For the second example (the Zipf distribution), we computed the 
maximum- likelihood estimate 9 from the data Y\, Y%, . . . , Y n —\, Y„ by finding 
the zero of the function g(9) — f(9) — J2fc=i ^fe ln(fc) = 0, where / is the same 
as in Table 4, namely f{6) = fe=i k~ 8 ln(fc)) / (j2l=i k~ § ) ■ We evaluated 

the zero 9 numerically, via bisection (see, for example, Chapter 9 of Press et al., 
2007). 

Remark 4.2. For the third example (the Poisson distribution), we employed 
Observation 1, with e = 10 -8 . 



5. Conclusion 

This paper provides efficient black-box algorithms for computing the confidence 
levels for one of the simplest, most natural goodness-of-fit statistics, in the 
limit of large numbers of draws. Although the present paper focuses on fam- 
ilies of probability distributions parameterized with a single scalar (and the 
predecessor to this article focuses on fully specified distributions), our meth- 
ods extend straightforwardly to any parameterization with multiple scalars (or, 
equivalcntly, to any parameterization with a vector). Furthermore, our methods 
can handle arbitrarily weighted means in the root-mean-square, in addition to 
the usual, uniformly weighted average considered above. 

There are many advantages to using the simple root-mean-square, as shown 
by Perkins, Tygert, and Ward (2011a). With the now widespread availability of 
computers, calculating the relevant P-values via Monte-Carlo simulations can 
be feasible; the algorithms of the present paper can also be suitable, and are 
efficient and easy-to-use. 
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0.2 0.4 0.6 0.8 1 

(b) 

Fig. 1:2x2 contingency-table/cross-tabulation of Table 4 



Table 1 
Values for Figure 1 

j 8 n I t s 

"(a) 10 3 ^3 4 190 .43E0 .44E1 
(b) 10 4 .03 4 190 .43E1 .45E2 
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0.2 0.4 0.6 0.8 1 

(b) 

Fig. 2: Zipf distribution of Table 4 



Table 2 
Values for Figure 2 

j 8 n I t s 

(a) W 3 1 100 350 .92E1 .13E2 

(b) 10 4 1 100 390 .11E3 .13E3 
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0.2 0.4 0.6 0.8 1 

(b) 

Fig. 3: Poisson distribution of Table 4 



Table 3 
Values for Figure 3 

j 9 n I t s 

(a) 10 3 1CU 36 290 .37E1 .86E1 

(b) 10 4 10.3 36 330 .37E2 .86E2 
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Table 4 

Values for Figures 1-3 and Tables 1-3 



Fig. /Table # p*(£) 9{Y 1 , Y 2 , . . ■ , y n -i, Y n ) 

Fig /Tabic 1 Pl = M - 9 ' P2 = m{1 ~ e) e-Y+Y, 
t ig./ laDie i ^ = g6 0) ^ = 96(1 _ 6)) » - n + r 3 

Fig./Tablc3 p fc = e -"0 fc - 1 /(fe-l)! ^E^i^" 1 )^ 
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